Model Selection

Multimodal processing

# Multimodal processing

Gemma 3n E2B It Unsloth Bnb 4bit

Gemma 3n-E2B-it is a lightweight open-source multimodal model launched by Google, built on the same technology as Gemini and optimized for low-resource devices.

Transformers English

Gemma 3n is a lightweight and state - of - the - art open - source model family launched by Google, supporting multimodal input and output.

Gemma 3n E4B It

Gemma 3n is a lightweight and state-of-the-art open-source multimodal model family launched by Google. It is built on the same research and technology as the Gemini model and supports text, audio, and visual inputs.

Nuextract 2.0 4B

NuExtract 2.0 is a series of multimodal models specifically trained for structured information extraction tasks. It supports text and image inputs and has multilingual processing capabilities.

Google.gemma 3 4b It Qat Int4 Unquantized GGUF

A quantized version of the image-to-text model based on Gemma 3 4B, aiming to make knowledge accessible to the public

Gemma 3 4b It Qat Autoawq

Gemma 3 is a lightweight open-source multimodal model launched by Google, built on Gemini technology, supporting text and image input and generating text output.

Smoldocling 256M Preview Mlx Fp16

This model is converted from ds4sd/SmolDocling-256M-preview to the MLX format, supporting image-text-to-text tasks.

Transformers English

Gemma 3 27b Pt Bnb 4bit

Gemma 3 is a lightweight open model series launched by Google, built on the same research and technology as the Gemini model, supporting multimodal input and text output.

Transformers English

Gemma 3 1b Pt Unsloth Bnb 4bit

Gemma 3 is a series of lightweight open models launched by Google, supporting multimodal input (text and images), with a 128K large context window, suitable for various tasks such as question answering and summarization.

Transformers English

Kaleidoscope Large V1

A document Q&A specialized model fine-tuned based on sberbank-ai/ruBert-large, supporting Russian and English document Q&A tasks.

Question Answering System

Transformers Supports Multiple Languages

Kaleidoscope Large V1

A document QA model fine-tuned from sberbank-ai/ruBert-large, excelling at extracting answers from documents, supporting Russian and English.

Question Answering System

Transformers Supports Multiple Languages

Kaleidoscope Small V1

A document question-answering model fine-tuned based on sberbank-ai/ruBert-base, excelling at extracting answers from document contexts, supporting Russian and English.

Question Answering System

Transformers Supports Multiple Languages

Ola-7B is a multimodal language model jointly developed by Tencent, Tsinghua University, and Nanyang Technological University, based on the Qwen2.5 architecture. It supports processing image, video, audio, and text inputs and outputs text.

Multimodal Fusion

Safetensors Supports Multiple Languages

This model converts PDF documents into Markdown format while preserving the original document layout structure and accurately recognizing mathematical formulas and tables.

Transformers Supports Multiple Languages

Pixtral 12b Nf4

A 4-bit quantized version based on the Mistral community's Pixtral-12B, focusing on image text-to-text tasks and supporting Chinese description generation.

Florence 2 DocVQA

This is a version of Microsoft's Florence-2 model fine-tuned for 1 day using the Docmatix dataset (5% of the data) with a learning rate of 1e-6

Kosmos 2 PokemonCards Trl Merged

This is a multimodal model fine-tuned based on Microsoft's Kosmos-2 model, specifically designed for recognizing Pokemon names on Pokemon cards.

Transformers English

Cell segmentation model developed by Sribd-med team, suitable for cell instance segmentation tasks in multimodal images

Image Segmentation

Transformers English

Donut Base Finetuned Latvian Receipts V2

A model based on the Donut architecture, specifically fine-tuned for Latvian receipt data

Text Recognition

S2t Small Mustc En De St

A speech-to-text transformer model trained for end-to-end English-to-German speech translation

Speech Recognition

Transformers Supports Multiple Languages

S2t Small Mustc En Ro St

A Transformer-based end-to-end speech translation model designed for English to Romanian speech translation

Speech Recognition

Transformers Supports Multiple Languages

S2t Small Mustc En Fr St

End-to-end English-to-French speech translation model based on S2T architecture, trained on the MuST-C dataset

Speech Recognition

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase